# TCGA-BRCA Preprocessed Multi-Omics Dataset
## README – Version 1.0 (November 2025)

## 1. Overview

This repository provides a fully preprocessed, analysis-ready multi-omics dataset derived from The Cancer Genome Atlas – Breast Invasive Carcinoma (TCGA-BRCA) cohort.

The dataset includes harmonized and batch-corrected RNA-seq, DNA methylation, Copy Number Variation (CNV), and clinical data for 710 patients across 16,163 genes.

This resource is intended for machine learning, multi-omics integration, survival modeling, and graph neural networks.

## 2. Dataset Contents

| File | Description |
|------|-------------|
| rna_FINAL.csv | Log₂(TPM+1) RNA-seq expression matrix (genes × patients) |
| meth_FINAL.csv | Promoter-level DNA methylation beta values (genes × patients) |
| cnv_FINAL.csv | GISTIC2 categorical CNV scores (genes × patients) |
| clinical_data.csv | Clinical and survival information (patients × variables) |
| Supplementary_Information.pdf | Figures, validation metrics, QC summaries |
| README.md | This document |

## 📊 Data Files Description

### 1. **rna_FINAL.csv** (RNA-seq Gene Expression)
**Description:** Transcriptomic data measuring gene expression levels across all patients. Values represent log₂-transformed TPM (Transcripts Per Million) with pseudocount, batch-corrected using ComBat, and quantile-normalized for cross-sample consistency.

**Format:**
- **Rows:** Gene symbols (16,163 genes)
- **Columns:** Patient TCGA barcodes (710 patients)
- **Values:** log₂(TPM+1), continuous scale [0, ~17]
- **Interpretation:** Higher values = higher gene expression

**Example:**
```
Gene      | TCGA-A1-A0SD | TCGA-A1-A0SE | TCGA-A1-A0SF | ...
----------|--------------|--------------|--------------|----
TP53      | 8.24         | 7.91         | 9.12         | ...
ERBB2     | 6.73         | 11.45        | 5.89         | ...
ESR1      | 9.87         | 2.34         | 10.12        | ...
```

---

### 2. **meth_FINAL.csv** (DNA Methylation)
**Description:** Epigenomic data measuring DNA methylation at gene promoter regions. Values are beta values (0-1 scale) averaged across all CpG probes within each gene's promoter region (TSS -1500bp to +500bp), batch-corrected and quantile-normalized.

**Format:**
- **Rows:** Gene symbols (16,163 genes)
- **Columns:** Patient TCGA barcodes (710 patients)
- **Values:** Beta values, continuous scale [0, 1]
- **Interpretation:** 0 = unmethylated, 1 = fully methylated; higher methylation typically correlates with lower gene expression

**Example:**
```
Gene      | TCGA-A1-A0SD | TCGA-A1-A0SE | TCGA-A1-A0SF | ...
----------|--------------|--------------|--------------|----
TP53      | 0.123        | 0.087        | 0.156        | ...
ERBB2     | 0.445        | 0.234        | 0.567        | ...
ESR1      | 0.089        | 0.723        | 0.045        | ...
```

---

### 3. **cnv_FINAL.csv** (Copy Number Variation)
**Description:** Genomic data indicating gene-level copy number alterations. Values are GISTIC2 categorical scores representing discrete copy number states relative to normal diploid (2 copies).

**Format:**
- **Rows:** Gene symbols (16,163 genes)
- **Columns:** Patient TCGA barcodes (710 patients)
- **Values:** Integer scores [-2, -1, 0, +1, +2]
- **Interpretation:** 
  - **-2:** Homozygous deletion (0 copies)
  - **-1:** Heterozygous deletion (1 copy)
  - **0:** Neutral/diploid (2 copies)
  - **+1:** Low-level gain (3-4 copies)
  - **+2:** High-level amplification (≥5 copies)

**Example:**
```
Gene      | TCGA-A1-A0SD | TCGA-A1-A0SE | TCGA-A1-A0SF | ...
----------|--------------|--------------|--------------|----
TP53      | 0            | -1           | 0            | ...
ERBB2     | +2           | 0            | +1           | ...
ESR1      | 0            | 0            | -1           | ...
```

---

## 🩺 Clinical Data 

### 4. **clinical_data.csv** (Clinical and Survival Information)
**Description:** Clinical variables and survival outcomes for each patient, aligned with multi-omics data.

**Format:**
- **Rows:** Patients (710 patients)
- **Columns:** Clinical variables
- **Key Variables:**
  - `Patient_ID`: TCGA barcode (matches omics column names)
  - `survival_time`: Time to event or censoring (days)
  - `survival_status`: Event indicator (0=censored, 1=death)
  - `age_at_diagnosis`: Patient age (years)
  - `gender`: Patient sex
  - `tumor_stage`: Pathological stage
  - `primary_diagnosis`: Histological diagnosis

**Example:**
```
Patient_ID    | survival_time | survival_status | age_at_diagnosis | tumor_stage
--------------|---------------|-----------------|------------------|-------------
TCGA-A1-A0SD | 1825          | 0               | 58               | Stage II
TCGA-A1-A0SE | 912           | 1               | 62               | Stage III
TCGA-A1-A0SF | 2134          | 0               | 51               | Stage I
```

---


## 4. Preprocessing Workflow

All omics layers were processed using a standardized pipeline:

1. **Patient Alignment**
   - Intersection of patients across RNA-seq, methylation, and CNV
   - Final cohort: 710 patients

2. **Gene Filtering**
   - Removal of low-variance / low-quality genes
   - Final gene count: 16,163

3. **Normalization**
   - RNA-seq: log₂(TPM+1)
   - Methylation: beta values aggregated at promoter
   - CNV: GISTIC2 calls

4. **Batch Effect Correction**
   - ComBat (parametric)

5. **Missing Value Handling**
   - KNN imputation (k=5) for <0.5% missingness
   - Final dataset: 0% missing values

6. **Quality Control**
   - 38 samples removed
   - QC flags recorded in supplementary files

## 5. Data Integrity and Validation

### 5.1 Dimensions

- RNA-seq: 16,163 × 710
- Methylation: 16,163 × 710
- CNV: 16,163 × 710
- Clinical: 710 × variables

### 5.2 Value Ranges

- RNA-seq: log₂(TPM+1), 0–17
- Methylation: 0–1
- CNV: integer categories (−2 to +2)

### 5.3 Cross-Omics Consistency (Biological Validation)

- Expression–CNV: positive correlation
- Expression–Methylation: negative correlation
- CNV–Methylation: minimal correlation
- Expected patterns observed in ERBB2, TP53, ESR1, PIK3CA

## 6. Usage Guide

### 6.1 Load Data in Python

```python
import pandas as pd

rna = pd.read_csv("rna_FINAL.csv", index_col=0)
meth = pd.read_csv("meth_FINAL.csv", index_col=0)
cnv = pd.read_csv("cnv_FINAL.csv", index_col=0)
clinical = pd.read_csv("clinical_data.csv")

# Verify alignment
assert list(rna.columns) == list(meth.columns) == list(cnv.columns)
assert list(rna.columns) == list(clinical["Patient_ID"])
```

### 6.2 Quick Start for ML

```python
X = rna.T.values  # (710 × 16163)
y = clinical["survival_status"].values
```

## 7. Recommended Applications

- Multi-omics machine learning
- Survival analysis
- Molecular subtyping
- GNN-based biomarker discovery
- Dimensionality reduction and integrative modeling

## 8. Data Source and Citation

**Primary Source:**

The Cancer Genome Atlas (TCGA) – Breast Invasive Carcinoma (BRCA)

**If using this dataset, cite the original TCGA publication:**

TCGA Research Network. (2012). Comprehensive molecular portraits of human breast tumours. Nature, 490, 61–70.

## 9. Notes

- The dataset is fully processed and ready for downstream analysis.
- No additional normalization or batch correction is required.
- Patient identifiers follow the first 12 characters of TCGA barcodes.

## 10. Contact

For questions or issues regarding the dataset or preprocessing pipeline, contact the dataset authors.